PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school. Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.
Out of many variables, I selected 24 columns and later reduced the number according to the requirement of my queries and tried to clean the dataset for faster execution.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
Let's load in PISA dataset and describe its properties through the questions below.
df = pd.read_csv('pisa2012.csv',encoding='ISO-8859-1')
df.info()
df.shape
From above wrangling steps, we see that the PISA data is really huge. The main dataset contains 485490 rows and 636 columns. In order to be familiar with the variables, there is another file(pisadict2012.csv) we can look into.
variables_pisa = pd.read_csv('pisadict2012.csv',encoding='ISO-8859-1')
variables_pisa
Still we cannot fathom all the variables at a time. So, my next step is to get assistance from Atom where I can open the csv file and easily go through the description of the variables. My primary goal is to figure out the variables related to the performance of students on designed category of subjects i.e. mathematics, science and reading . So, before we spend longer on finding our features of interest, it would be a better idea to read the test design and scaling PISA data part for clear meaning of those variables here. The scaling of PISA data clarifies why the plausible value in mathematics, science, reading and subscale of math content are valid score to analyse students' performance.
There are 485490 students' performance in the dataset with 636 features. There are 268 variables those are numeric in nature. Due to the complex nature of these dataset, I am mostly interested in pupil performance in mathematics, reading and science those are coded by plausible values.
You can find them in columns: PV1MATH-PV5MATH (for math), PV1READ-PV5READ (for reading) and PV1SCIE-PV5SCIE (for science). For given area all five values PV1- PV5 are just independent estimations of the student performance in given area.
The rest of the features are object type.
There are way too many features to consider. I am curious about the following three questions.
- Are there differences in achievement based on teacher practices and attitudes?
- Are there differences in achievement based on gender, location, or student attitudes?
- Does socio-economic status matter?
I decided to explore the performance related dependent variables such as plausible value in mathematics, science and reading. First, we will explore univariate feature and then bivariate exploration followed by multivariate exploration.
My expectation is that plausible values in math, science and reading would be strongly related to the resources avaiable to individual student. One of the prime resources is teacher in their school. The mentor-student relationship and their style towards teaching would be a good start to answer my first question.
PV1MATH-PV5MATH, PV1READ-PV5READ,PV1SCIE-PV5SCIE(Plausible value in all categories) are dependent varaiables.
The following variables can be considered as direct measures of teacher's attitudes, practices and approaches on teaching. For sake of simplicity, I am going to consider the below variables.
"TCHBEHFA","Teacher Behaviour: Formative Assessment" "STUDREL","Teacher Behaviour: Teacher Student Relations" "TEACHSUP","Teacher Support"
To answer the second question in regard to location, gender and students' attitude, the following variables are important. "OUTHOURS","Out-of-School Study Time" "CNT", Location "ST04Q01","Gender of Student" 'SMINS':'Learning Time Science','LMINS':'Learning Time Language','MMINS':'Learning Time Math'
"ESCS","Index of economic, social and cultural status" can shed some light on socio-economic status of students' and their achievement.
#Only necessary columns for our analysis are kept.
df = df[['CNT','ST04Q01','OUTHOURS','PV1MATH','PV2MATH',
'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ',
'PV3READ', 'PV4READ', 'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE',
'PV4SCIE', 'PV5SCIE', 'TCHBEHFA','STUDREL','ESCS','TEACHSUP','SMINS','LMINS','MMINS']]
df.info()
df.head()
#view missing values
df.isnull().sum()
#Drop duplicate and null values
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df.info()
# Find the mean score for all subjects and drop the unnecessary columns
df['Avg Math Score'] = (df['PV1MATH'] + df['PV2MATH'] + df['PV3MATH'] + df['PV4MATH'] + df['PV5MATH']) / 5
df['Avg Reading Score'] = (df['PV1READ'] + df['PV2READ'] + df['PV3READ'] + df['PV4READ'] + df['PV5READ']) / 5
df['Avg Science Score'] = (df['PV1SCIE'] + df['PV2SCIE'] + df['PV3SCIE'] + df['PV4SCIE'] + df['PV5SCIE']) / 5
df.drop(columns=['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ',
'PV5READ', 'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace=True)
#Rename the columns
df.rename({'CNT': 'Country', 'ST04Q01': 'Gender',
'STUDREL': 'Teacher Student Relations', 'TCHBEHFA': 'Formative Assesment',
'OUTHOURS': 'Out_of_School Study Time',
'Avg Science Score': 'Average Science Score',
'Avg Math Score': 'Average Math Score',
'Avg Reading Score': 'Average Reading Score',
'ESCS': 'Socio_economic_cultural Status',
'TEACHSUP':'Teacher Support','SMINS':'Learning Time Science','LMINS':'Learning Time Language','MMINS':'Learning Time Math'}, axis='columns', inplace=True)
df.info()
df.head()
I'll start by looking at the distribution of the main variable of interest: Mean score in each case would be best to start with.
For first question, let's do the countplot for student's average math score, average science score and average reading score. In this section, I've investigated the distributions of individual variables.
fig, ax = plt.subplots(ncols=3, figsize = [20,5])
binsize = 50
variables = ['Average Math Score', 'Average Science Score', 'Average Reading Score']
for i in range(len(variables)):
var = variables[i]
bins = np.arange(0, max(df[var])+binsize, binsize)
ax[i].hist(data = df, x = var, bins = bins)
ax[i].set_xlabel('Score')
ax[i].set_ylabel('Count')
ax[i].set_title('{}'.format(var))
plt.suptitle('Distribution of Average Scores in Math, Science and Reading', fontsize =14, weight ='bold')
plt.show();
From the three plots above on math, science and reading score it looks like the distribution for math score falls in normal distribution. Science and reading score are slightly skewed towards left. The boxplot would give clearest idea about the percentile and median data for the variables. Let's create boxplot.
#create boxplot for all three variables
fig, ax = plt.subplots(nrows=3, figsize = [7,12])
variables = ['Average Math Score', 'Average Science Score', 'Average Reading Score']
for i in range(len(variables)):
var = variables[i]
ax[i].boxplot(data = df, x = var)
ax[i].set_xlabel('{}'.format(var))
Since the distributions are pretty much normal, we can move onto study time variable.First we will consider out of school study hours.The index OUTHOURS was computed by summing the time spent studying for school subjects and it is consisted of homework guided homework, personal tutor,commercial company,with parent and computer.
binsize = 5
bins = np.arange(0, df['Out_of_School Study Time'].max()+binsize, binsize)
plt.figure(figsize=[10,7])
plt.hist(data = df, x = 'Out_of_School Study Time', bins = bins)
plt.xlabel('Study Hours Outside School/week')
plt.ylabel('Frequency of hours spent/week')
plt.show()
This is highly skewed towards right and I think looking at the statistics would be best to comment further. After 35 hrs, the frequency really diminishes and we can say that majority of students from our dataset have shown their effort outside school. Let's do the log transformation for hours.
# there's a long tail in the distribution, so let's put it on a log scale instead
log_binsize = 0.1
bins = 10 ** np.arange(0, np.log10(df['Out_of_School Study Time'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = df, x = 'Out_of_School Study Time', bins = bins)
plt.xscale('log')
plt.xticks([1,3,10,30,100,300])
plt.xlabel('Hour')
plt.ylabel('Count')
plt.title('Distribution of outside school study hours')
plt.show()
df['Out_of_School Study Time'].describe()
Now, we can move onto study the learning time spent on each subject after school.Learning time in test language (LMINS) was computed by multiplying the number of minutes on average in the test language class by number of test language class periods per week (ST69 and ST70). Comparable indices were computed for mathematics (MMINS) and science (SMINS).
#fig, ax = plt.subplots(nrows=3, figsize = [10,15])
#binsize = 50
#variables = ['Learning Time Math','Learning Time Language','Learning Time Science']
#for i in range(len(variables)):
# var = variables[i]
#bins = np.arange(min(df[var]), max(df[var])+binsize, binsize)
#ax[i].hist(data = df, x = var, bins = bins)
#ax[i].set_xlabel('Minutes')
#ax[i].set_ylabel('Count')
#ax[i].set_title('{}'.format(var))
#ax[i].set_xlim(0, 1000)
#plt.suptitle('Distribution of formative assesment,teacher support and teacher student relations amon students')
#plt.show()
fig, ax = plt.subplots(nrows=3, figsize = [10,15])
binsize = 50
variables = ['Learning Time Math','Learning Time Language','Learning Time Science']
for i in range(len(variables)):
var = variables[i]
bins = np.arange(0, df[var].max()+binsize, binsize)
ax[i].hist(data = df, x = var, bins = bins)
ax[i].set_xlabel('Minutes')
ax[i].set_ylabel('Count')
ax[i].set_title('{}'.format(var))
ax[i].set_xlim(0, 1000)
plt.show()
The math and lierature ditribution match with the unimodal and normal distribution. For Science we can see a clear right skew.Since all of the Learning Time variables have values that are beyond 600 minutes, and these values might distort our later plots, we should analyze them and determine if it makes sense to disregard them.
# Select high outliers for the learning time total, using criteria eyeballed from the plot
high_outliers_math = (df['Learning Time Math'] > 600)
print(high_outliers_math.sum())
print(df.loc[high_outliers_math,:])
we will consider Math teaching,
high_outliers_language = (df['Learning Time Language'] > 600)
print(high_outliers_language.sum())
print(df.loc[high_outliers_language,:])
high_outliers_science = (df['Learning Time Science'] > 600)
print(high_outliers_science.sum())
print(df.loc[high_outliers_science,:])
# Remove outliers
df = df.loc[-high_outliers_math & -high_outliers_language & -high_outliers_science,:]
# Re-plotting the distributions of Learning Times
fig, ax = plt.subplots(nrows=3, figsize = [10,20])
variables = ['Learning Time Math', 'Learning Time Language', 'Learning Time Science']
for i in range(len(variables)):
var = variables[i]
ax[i].hist(data = df, x = var)
ax[i].set_xlabel('{} (mins/week)'.format(var))
ax[i].set_ylabel('Frequency')
ax[i].set_title('{}'.format(var))
plt.show()
We still have location,gender,teachers' practices and attitudes(Formative Assesment,Teacher Student Relations,Socio_economic_cultural Status,Teacher Support) to analyse.
# Find number of students from different countries
plt.figure(figsize = [14.7,10.27])
base_color = sb.color_palette()[0]
sb.countplot(data = df, y = 'Country', color = base_color, order=df['Country'].value_counts().index,orient="h")
plt.title('Number of students based on their countries')
plt.xticks(rotation='vertical');
plt.xlabel('Number of students');
df.Country.value_counts()
From the above plot, it is clear that there are few countries with several name. United States of America,and Massachusetts (USA), Florida(USA), Connecticut(USA), China-Shanghai, Hong-Kong-China, Macao-China are, Chinese Taipei are repeatation and it would be better to visualize them as a single country. So I will convert them to appropriate country.
# this will replace the incorrect names with correct one
df.replace({'Country': {'Connecticut (USA)': 'USA', 'Florida (USA)': 'USA','Massachusetts (USA)': 'USA','United States of America': 'USA',
'Hong Kong-China':'China', 'China-Shanghai':'China', 'Macao-China':'China','Macao':'China','Hong Kong':'China','Czech Republic':'Czech',
'Czechia':'Czech','Korea, Republic of':'Korea','United States':'USA','Chinese Taipei':'Taiwan'}},inplace = True)
#replace_names_with_correct_ones()
df.Country.value_counts()
# Find number of students from different countries
plt.figure(figsize = [14.7,10.27])
base_color = sb.color_palette()[0]
sb.countplot(data = df, y = 'Country', color = base_color, order=df['Country'].value_counts().index,orient="h")
plt.suptitle('Number of students based on their countries',fontsize = 14, weight ='bold')
plt.xticks(rotation='vertical');
plt.xlabel('Number of students');
df[['Teacher Support','Formative Assesment','Teacher Student Relations']].describe()
fig, ax = plt.subplots(nrows=3, figsize = [10,15])
binsize = 0.5
variables = ['Formative Assesment', 'Teacher Support', 'Teacher Student Relations']
for i in range(len(variables)):
var = variables[i]
bins = np.arange(min(df[var]), max(df[var])+binsize, binsize)
ax[i].hist(data = df, x = var, bins = bins)
ax[i].set_xlabel('Teacher Attitude')
ax[i].set_ylabel('Count')
ax[i].set_title('{}'.format(var))
Teacher Support and Formative Assesment distribution have shifted toward left with two modes. Most students pretty much received some attention in some lessons in case of formative assesment such as . In case of teacher support, at least they got some sort of support in some lessons but also showing similar count where no support was present. Student teacher relationship looks normal distribution. Most answers falls in strongly disagreement to disagree zone. May be violin plot can enlight more in later section.
Let's look into socio economic cultural status of the students now.ESCS is used in many PISA reports and analyses, both as a control for the socio-economic status of students and schools and in bivariate correlations with performance as one of the main indicators of equity in an education system.
binsize = 0.2
bins = np.arange(df['Socio_economic_cultural Status'].min(), df['Socio_economic_cultural Status'].max()+binsize, binsize)
plt.figure(figsize=[10,7])
plt.hist(data = df, x = 'Socio_economic_cultural Status', bins = bins)
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Socio_economic_cultural Status Frequency')
plt.show()
The distribution is skewed towards left.
df['Socio_economic_cultural Status'].describe()
plt.figure(figsize = [5,5])
base_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'Gender', color = base_color, order=df['Gender'].value_counts().index)
plt.title('Number of students based on gender')
plt.xticks(rotation='vertical');
plt.ylabel('Number of students');
For 'Average Total Score', the distribution was strikingly normal. However, this was expected to an extent, since student grades typically fall along a bell curve. As a result, no unusual points stood out for this variable, nor did any stand out for the three scores that resulted in the total score. Therefore, no transformations were necessary to make sense of the data.
Female students were greater than male counts.
The secondary features investigated were Study Times, Learning Times, Socio_economic_cultural Status, Teacher Support, Formative Assesment, Teacher Student Relations, Gender of student and Country.
For Study Times, the total had a strong right skew.To better understand this feature, we spread the total across a logarithmic scale to see if it was not in fact unimodal or to see any other irregularities. In the end, it ended up being unimodal and quite normal.
As for the Learning Time, this data clearly had outliers, so for each of the Learning Time's, the outliers over 600 minutes were excluded. This was done to look at the more typical student results, and so that later plots will not be distorted by these exceptionally dedicated students.
In this section, I investigated relationships between pairs of variables in my data. I start with heatmap and crrelation plot.
numeric_vars = ['Average Math Score', 'Average Reading Score', 'Average Science Score', 'Out_of_School Study Time','Learning Time Math','Learning Time Language','Learning Time Science','Socio_economic_cultural Status','Teacher Student Relations','Teacher Support','Formative Assesment']
categorical_vars = ['Country','Gender']
# Correlation plot
plt.figure(figsize = [12,10])
heatmap = sb.heatmap(df[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
plt.show()
# plot matrix: sample 500 students so that plots are clearer and
# they render faster
samples = np.random.choice(df.shape[0], 500, replace = False)
df_samp = df.loc[samples,:]
g = sb.PairGrid(data = df_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);
Now I will look into gender profile and its effect on performance.
# Find average score based on gender
gender_group = df.groupby(['Gender']).mean()[['Average Math Score', 'Average Reading Score', 'Average Science Score']]
ax = gender_group.plot.bar(figsize=(12,6));
plt.title('Average scores of students based on their gender.')
plt.ylabel('Average Score')
plt.xticks(rotation='horizontal');
Although the average scores are pretty close for male and felame, but Females have higher reading score on average than male students.
Let's check with average scores individually per country.
plt.figure(figsize = [14.27, 10.27])
plt.subplots_adjust(wspace = 0.85) # adjust spacing between subplots, in order to show long country names nicely
plt.suptitle("Average Score Based On Different Countries", y = 1.04,fontsize = 14, weight = "bold")
#plt.tight_layout();
math_score_country_order = df.groupby('Country')['Average Math Score'].mean().sort_values(ascending = False).index[:20]
reading_score_country_order = df.groupby('Country')['Average Reading Score'].mean().sort_values(ascending = False).index[:20]
science_score_country_order = df.groupby('Country')['Average Science Score'].mean().sort_values(ascending = False).index[:20]
plt.subplot(1, 3, 1)
sb.boxplot(y = df['Country'],x = df['Average Math Score'], order = math_score_country_order, color = sb.color_palette()[1]);
plt.ylabel('Countries (ordered descendingly by score ranking)')
plt.title('Math score distributions by country');
plt.subplot(1, 3, 2)
sb.boxplot( y = df['Country'],x = df['Average Reading Score'], order = reading_score_country_order, color = sb.color_palette()[1]);
plt.ylabel(''); # Remove the redundant label
plt.title('Reading score distributions by country');
plt.subplot(1, 3, 3)
sb.boxplot( y = df['Country'],x = df['Average Science Score'], order = science_score_country_order, color = sb.color_palette()[1]);
plt.ylabel(''); # Remove the redundant label
plt.title('Science score distributions by country');
From the above plot of average math, science and rerading against the location, east asian countries are showing dominance. China is all along maintainting the highest average score. Japan is at second place for science and reading score where as Singapore is third. Singapore has second highes average math score.
Among the top ten countries, Germany, Taiwan, Korea, Belgium, Finland, Greece, Poland, Netherlands, Switzerland,Liechtenstein, Ireland are also present.
Indonesia, Peru, Qatar, Brazil and Argentina reamins in the worst five list.
# Find out tecahers' attitude and practices based on location
plt.figure(figsize = [14.27, 10.27])
plt.subplots_adjust(wspace = 0.85) # adjust spacing between subplots, in order to show long country names nicely
plt.suptitle("Average Score Based On Teacher Support, Formative Assesment and Teacher Student Relations", y = 1.04,fontsize = 14, weight = "bold")
teacher_support_country_order = df.groupby('Country')['Teacher Support'].mean().sort_values(ascending = True).index[:20]
teacher_student_rel_country_order = df.groupby('Country')['Teacher Student Relations'].mean().sort_values(ascending = True).index[:20]
formative_assesment_country_order = df.groupby('Country')['Formative Assesment'].mean().sort_values(ascending = True).index[:20]
plt.subplot(1, 3, 1)
sb.boxplot(x = df['Teacher Support'], y = df['Country'], order = teacher_support_country_order, color = sb.color_palette()[1]);
plt.ylabel('Countries ordered descendingly by support ranking')
plt.title('Teacher support distributions by country');
plt.subplot(1, 3, 2)
sb.boxplot(x = df['Teacher Student Relations'], y = df['Country'], order = teacher_student_rel_country_order, color = sb.color_palette()[1]);
plt.ylabel(''); # Remove the redundant label
plt.title('Teacher Student Relations distributions by country');
plt.subplot(1, 3, 3)
sb.boxplot(x = df['Formative Assesment'], y = df['Country'], order = formative_assesment_country_order, color = sb.color_palette()[1]);
plt.ylabel(''); # Remove the redundant label
plt.title('Formative assesment distributions by country');
So, the teachers' attitude and practices are not really dictating any better scoring. From the above plot, we see that the low scoring countries reflects higher teacher support(above 0), higher teacher student relations(1) and Formative assesment(between 0-1). This finding aligns with out previous exploration in correlation matrix plot where we saw negetive coefficient for all of the teachers' practices with average scores.
# Effect of socio-economic-cultural status on students' achievement
socio_econo_cultural_country_order = df.groupby('Country')['Socio_economic_cultural Status'].mean().sort_values(ascending = False).index[:20]
plt.figure(figsize = [14.27, 10.27])
sb.violinplot(x = df['Socio_economic_cultural Status'], y = df['Country'], order = socio_econo_cultural_country_order, color = sb.color_palette()[2]);
plt.ylabel('Country');
plt.title('Socio-economic and cultural status distributions by country',fontsize = 14, weight = "bold");
From bivariate exploration, it is more confirmed that there is not so strong relationship among teachers' attitude and practices with score. Countries those performed lower score had better help from teachers compared to high scoring countries.
Learning time and outside school study hours also do not show any strength in regard to average score. Analysis in later section may provide explanation.
Regarding socio-economic status, European countries,North America, Australia along with few Asian countries such as Japan, Qatar, Korea, Israel are solvent.Interestingly, China is not among them even though it has average highest score in math, science and reading.
Here we will explore three variables at a time to find isight about the relationship among them. Here we will focus on top ten contries according to average scores. Then we will figure out how the geographic position and three other major variables such as outside of school study time, socio_economic_cutural status and teachers' practices and attitude influence students' achievement.
df_new.head()
df_new.describe()
# Find out average math score vs socio economic cultural status for different countries
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7, size = 3,palette = 'vlag')
g.map(plt.scatter,'Socio_economic_cultural Status','Average Math Score',alpha = 1/10)
g.set(yscale = 'log') # need to set scaling before customizing ticks
y_ticks = [100, 300, 1000]
g.set(yticks = y_ticks, yticklabels = y_ticks)
plt.suptitle("Global Profile of Average Math Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold');
# Find out average science score vs socio economic cultural status for different countries
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3)
#g.map(sb.regplot,'Socio_economic_cultural Status','Average Science Score')
g.map(plt.scatter, 'Socio_economic_cultural Status','Average Science Score',alpha = 1/10)
g.set_xlabels('Socio_economic_cultural Status')
g.set_ylabels('Average Science Score')
plt.suptitle("Global Profile of Average Science Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold');
## Find out average reading score vs socio economic cultural status for different countries
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3,palette = 'vlag')
g.map(plt.scatter,'Socio_economic_cultural Status','Average Reading Score',alpha = 1/10)
g.set_xlabels('Socio_economic_cultural Status')
g.set_ylabels('Average Reading Score')
plt.suptitle("Global Profile of Average Reading Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold');
def hist2dgrid(x, y, **kwargs):
""" Quick hack for creating heat maps with seaborn's PairGrid. """
palette = kwargs.pop('color')
bins_y = np.arange(100, 800+50,100)
bins_x = np.arange(-6, 4+0.5, 0.5)
plt.hist2d(x, y, bins = [bins_x, bins_y], cmap = palette, cmin = 0.5)
# create faceted heat maps on levels of the Socio_economic_cultural Status variable
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 3, height = 4)
g.map(hist2dgrid, 'Socio_economic_cultural Status','Average Reading Score', color = 'inferno_r')
g.set_xlabels('Socio_economic_cultural Status')
g.set_ylabels('Average Reading Score')
plt.suptitle("Global Profile of Average Reading Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold')
plt.colorbar()
plt.show();
# create faceted heat maps on levels of the cut variable
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 3, height = 4)
g.map(hist2dgrid, 'Socio_economic_cultural Status','Average Science Score', color = 'inferno_r')
g.set_xlabels('Socio_economic_cultural Status')
g.set_ylabels('Average Science Score')
plt.colorbar()
plt.suptitle("Global Profile of Average Science Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold')
plt.show()
# create faceted heat maps on levels of the cut variable
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 3, height = 4)
g.map(hist2dgrid, 'Socio_economic_cultural Status','Average Math Score', color = 'inferno_r')
g.set_xlabels('ESCS')
g.set_ylabels('Score')
plt.colorbar()
plt.suptitle("Global Profile of Average Math Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold')
plt.show()
Next step of exploration is to find out study hours outside school and its impact on average score by country.
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3,palette = 'vlag')
g.map(plt.scatter,'Out_of_School Study Time','Average Math Score',alpha = 1/10)
g.set_xlabels('Outside school study hour')
plt.suptitle("Global Profile of Average Reading Score against Outside School Study Hour", y = 1.04, fontsize = 14,weight = 'bold')
g.set_ylabels('Score');
May be log transformation of both axes can display better relationship since most of the data points are cluttred between 0 to 50 hrs.
# Log transformation of both axes
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3,palette = 'vlag')
g.map(plt.scatter,'Out_of_School Study Time','Average Math Score',alpha = 1/10)
g.set(xscale = 'log') # need to set scaling before customizing ticks
g.set(yscale = 'log')
y_ticks = [100,300,500,1000]
x_ticks = [1,2,10,20,50,100]
g.set(xticks = x_ticks, xticklabels = x_ticks,yticks = y_ticks,yticklabels = y_ticks )
plt.suptitle("Global Profile of Average Math Score against Outside School Study Hour", y = 1.04, fontsize = 14,weight = 'bold');
Again, the log transformation of average math scores against outside school study hours does not really reflect any picture of strong relationship. Across the countries, average math score has similar range of scores even though they spent long or short hours.
For simplicity of visualization, let's take top five countries to show.
#redefine dataset with our choice of variables
top_five_countries = ['China','Japan','Singapore','Taiwan','Korea']
df_new = df[['Country', 'Average Math Score', 'Average Reading Score', 'Average Science Score','Socio_economic_cultural Status','Out_of_School Study Time','Formative Assesment','Teacher Support']][df['Country'].isin(top_five_countries)]
# Log transformation of both axes
g = sb.FacetGrid(data = df_new, col = 'Country', col_wrap = 3,size = 4,palette = 'vlag')
g.map(plt.scatter,'Out_of_School Study Time','Average Math Score',alpha = 1/10)
g.set(xscale = 'log') # need to set scaling before customizing ticks
x_ticks = [1,2,10,20,50,100]
g.set(xticks = x_ticks, xticklabels = x_ticks )
plt.suptitle("Global Profile of Average Math Score against Out_of_School Study Time", y = 1.04, fontsize = 14,weight = 'bold');
Let's finalise the analyse with teachers' attitude and practices.
# Find tecahers' attitude and their impact on average math score
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3,palette = 'vlag')
g.map(plt.scatter,'Formative Assesment','Average Math Score',alpha = 1/10) #Formative assesment
g.set(yscale = 'log')
y_ticks = [100,300,500,1000]
#x_ticks = [0.01,0.03,0.1,0.3,0.6,1,3]
g.set(yticks = y_ticks,yticklabels = y_ticks )
plt.suptitle("Global Profile of Average Math Score against Formative Assesment", y = 1.04, fontsize = 14,weight = 'bold');
Looks like Formative Assesment does not impact the performance as well. All ranges of formative assesment achieved more or similar score across the counties.
#Find the impact of teacher support on average math score
g = sb.FacetGrid(data = df, col = 'Country', col_wrap = 7,size = 3,palette = 'vlag')
g.map(plt.scatter,'Teacher Support','Average Math Score',alpha = 1/10) # Teacher support
#g.set(xscale = 'log') # need to set scaling before customizing ticks
g.set(yscale = 'log')
y_ticks = [100,300,500,1000]
#x_ticks = [1,2,10,20,50,100]
g.set(yticks = y_ticks,yticklabels = y_ticks )
plt.suptitle("Global Profile of Average Math Score against Teacher Support", y = 1.04, fontsize = 14,weight = 'bold');
For simplicity of visualization, let's take top five countries to show.
#Find the impact of teacher support on average math score
g = sb.FacetGrid(data = df_new, col = 'Country', col_wrap = 3,size = 4,palette = 'vlag')
g.map(plt.scatter,'Teacher Support','Average Math Score',alpha = 1/10) # Teacher support
y_ticks = [300,600,1000]
g.set(yscale = 'log')
g.set(yticks = y_ticks,yticklabels = y_ticks )
plt.suptitle("Global Profile of Average Math Score against Teacher Support", y = 1.04, fontsize = 14,weight = 'bold');
For simplicity of visualization, let's take top five countries to show.
# create faceted heat maps on Socio_economic_cultural Status variable
g = sb.FacetGrid(data = df_new, col = 'Country', col_wrap = 3, height = 5)
g.map(hist2dgrid, 'Socio_economic_cultural Status','Average Math Score', color = 'inferno_r')
g.set_xlabels('Socio_economic_cultural Status')
g.set_ylabels('Average Math Score')
plt.colorbar()
plt.suptitle("Global Profile of Average Math Score against Socio_economic_cultural Status", y = 1.04, fontsize = 14,weight = 'bold')
plt.show();
In this part, it is visible that irrespective of country, the better socio_economic_cultural status resulted higher score in Math. Although we did log transformation to observe clear relationship between features, the observation remained same. Science and Reading score follow the similar pattern. The heatmap clarifes how the high scorer are associated with socio_economic_cultural status for all of the countries. Although the count is small for score above 700, top scoring countries like China, Japan, Taiwan, Australia and European countries hold the trend.
As we explored in previous sections that outside school study times has no impact on performance and this is really surprising.